> To: Fault@winternet.com, Summary@winternet.com, bugtraq@crimelab.com > Subject: Seg Methinks perhaps you forgot the quotes around the subject :-) > For general background, a segmentation fault occurs when a > "unprivaledged" process accesses a memory address which is not in its > address space or trys to write to memory which has been marked > read-only. Well, mostly. A segmentation fault in the sense of something that generates a SIGSEGV is only a subset of those; if a process tries to access nonexistent virtual memory just off the end of its stack, the kernel will normally just grow the stack instead. And when copy-on-write memory is created (eg, via fork(), or mmap()), the memory is normally set up as read-only, and a write access causes the kernel to transparently create a read/write copy of the page and let the process write to that. Also, you imply that it is possible for a process to be privileged in a way that allows it to make such memory accesses. This is not the case in any system I have ever heard of. > My question asked how such a scheme was implemented. Specifically, > it asked if hardware support was needed to implement such a scheme. > I asked the question, because I did not understand how the kernel, > being just another process (not hardware), could enforce memory > restriction on another process, when atthe time the kernel is not > even executing. The kernel is not really "just another process". "Process" is a software notion, created and maintained by the kernel. The kernel is special because it (usually) executes in supervisor mode (see my next paragraph). > The answer is, yes, hardware support is required. > The cpu has what is called a MMU (Memory Management Uniy). There is another critical notion: that of user mode and supervisor mode (also called such things as kernel mode[%]). On a modern machine (as in, one capable of running a multiuser system like NetBSD), there is a mode bit that grants or denies certain privileges, typically the ability to execute certain instructions. On the 68020, for example, the MOVES and MOVEC instructions work only when they are executed in supervisor mode. [%] There is at least one machine - the VAX - where there are four modes (kernel, executive, supervisor, and user), all different. For our purposes, and for all UNIX derivatives I know of, two modes are all that are used. I'll continue to speak of the privileged mode as "supervisor" and the nonprivileged mode as "user"; on the VAX, what I am calling "supervisor mode" is what the hardware docs speak of as "kernel mode". > This unit keeps track (in its own private memory?) Yes, essentially. It usually is not organized in the typical von-Neumann address space way, but it nevertheless is essentially a small amount of memory that's private to the MMU hardware. > of virtual memory addressing, memory ownership, and maybe a few other > things I'm not aware of. The MMU normally keeps track of virtual-to-physical mappings, which are really just lookup tables, and protection. Every memory access from user mode must pass the protection checks set up in the MMU (or the access is denied and an exception occurs) and then the lookup table turns it into a physical address. (To reduce the size of these lookup tables, they normally work on just the high N bits of the address, where N varies from one machine to another, sometimes from one variant to another, with the low bits passed through unchanged.) On many machines, the MMU also affects supervisor mode accesses as well, and there are more protection bits, to permit setting up memory that user-mode can access, that supervisor-mode can access but user-mode can't, or that not even supervisor-mode can access. (This is a simplification, since there are at least three common types of access[%]. The last kind may sound useless, but it's valuable as an aid to catching bugs in the kernel.) [%] Read (read as data), write (write as data), and execute (read for instruction execution); some hardware makes no distinction between read and execute. > The kernel process is given special privaleges by the CPU/MMU to read > and write to these memory tables. This is usually done by ensuring that the hardware-provided mechanisms to read and write the MMU setup are accessible to supervisor mode but not user mode. > It assigns an address space to a process. When that process attempts > to access memory it isn't supposed to, the MMU interupts the process, > swaps it out to someplace, executes the kernel code (that was > previously setup to be by the kernel) to handle the page fault. Right in outline, wrong in some details. The MMU just notices the attempted access and causes the CPU to take a memory addressing trap. The MMU does not "swap [] out" the user process; if that is done, it's done by the kernel. The trap amounts to little more than "save the CPU state somewhere", "set mode to supervisor", and "jump to the memory management exception handler". (The hardware locates the address for this jump typically by looking in a table set up by the kernel at boot time, a table specified in a way that can be used only by supervisor mode.) > This normally results in the kernel sending a SIGSEGV to the > "malicious process". Yes, since you started with a proccess attempting to "access memory it isn't supposed to". But the same mechanisms are used to provide virtual memory, that is, more memory than actually exists on the system. When this is being done, the MMU is set up so that the memory that exists only virtually (that is, not in physical memory but actually out on disk somewhere) is no-access in the MMU. Then when the user process tries to access it and takes the MMU exception trap, the kernel notices that the access was to a valid address and fetches the required page off disk into some physical memory page, fiddles the MMU to point to that physical page and resets the protections to what they should be, and lets the access happen. > The exploit script posted with the assembly source was a way of > subverting this mechanism in a way I still don't fully understand. I > more detailed understanding of the above process specific to the > machine and kernel where the bug exists/existed is necessary. Right. > With my current understanding and assuming there is no bug in the > hardware itself, I don't understand how the exploit script is able to > overwrite any kernel memory. If at the point where it trys to > overflow some buffer (where is this buffer? In the hardware?) it is > stopped by the MMU, how then does it actually get to overwrite kernel > memory? I'll explain below. > Does it actually get to access the memory, THEN the kernel is told > that a fault has a occured, and then due to the bug the kernel > doesn't clean things up properly? A reasonable guess, but no, that's not what's going on. > I guess a start might be to know what the asm instructions "restore" > and "save" do. The SPARC architecture involves something called "register windows". There are at any moment 32 registers accessible, divided into four groups of eight: the "ins" %i0-%i7, the "outs" %o0-%o7, the "locals" %l0-%l7, and the "globals" %g0-%g7. Conceptually, there is an infinite chain of "windows" of registers, with the ins of each window being the same as the outs of the adjacent window (the globals are common to all windows): +-------- window #N --------+ +------- window #N+2 -------+ / \ / \ %i0...%i7 %l0...%l7 %o0...%o7 %i0...%i7 %l0...%l7 %o0...%o7 ... %o0...%o7 %i0...%i7 %l0...%l7 %o0...%o7 %i0...%i7 ... / \ / \ ... --------+ +------- window #N+1 -------+ +-------- ... At any given time, exactly one window is current. All register accesses use the current window to determine which physical register is accessed for a given register number. A "save" instruction shifts from using window N to using window N+1; a "restore" shifts the other way. Normally, a procedure call puts values in %o0 through %o7; the "save" makes the calling stack frame's registers inaccessible, and at the same time shifts those values to %i0 through %i7 (or more precisely, it shifts names so that the name %i0 refers to the same register that the name %o0 used to refer to, and similarly for the other seven). The called procedure then has its own %l0...%l7 and %o0...%o7 to use as it pleases; when it's done, a "restore" shifts back to the caller's window, with %i0...%i7 being shifted back to %o0...%o7. However, of course the hardware doesn't have an infinite set of registers. It actually has enough registers for some small number of windows, typically 8, a register (which I'm not sure is even accessible) saying which of those is the current window, and a (privileged) register saying which of those windows are accessible. If a user process does a save or restore that would cause it to shift into an inaccessible window, a trap is taken in a way similar to the MMU traps I described above: the CPU shifts to supervisor mode and jumps to the handler set up by the kernel. It is the responsibility of this handler to arrange for the desired window to be accessible. In the case of a save instruction (a window overflow trap), this means dumping one of the full windows to the stack; for a restore instruction (a window underflow trap), it means reading the desired window off the stack into the registers. (It would be possible for the trap handler to write or read more than one window at a time, but this normally isn't done because studies have shown it to almost always be counterproductive.) In support of this, two of the registers (%i6 and %o6) are reserved by software convention to keep track of the stack; they are used by the trap handlers to determine where to store/load windows to/from. Now the stage is set, and I can explain the bug the posted code is trying to take advantage of. The window underflow and overflow trap handlers run in supervisor mode, as they must since they have to read and write the %wim (window invalid mask) register. The bug is that they don't check that the user process has access to the place they need to read a window from (for the underflow trap) or write a window to (for the overflow trap). If the user process has trashed %sp or %fp (whichever one the handler uses), this code will attempt to access somewhere illegal. The posted code simply makes the relevant register point to some memory that supervisor mode can read/write but user mode can't, then takes a window overflow or underflow trap (depending on whether it wants to read or write). If the register pointed to memory that not even supervisor mode could access, the window trap code would get an invalid access and it would either kill the user process or panic the system, depending on how carefully the MMU trap handler checks things. But as it is, the window save/restore code doesn't notice that while it, running in supervisor mode, can access that memory, user mode can't, so it happily reads or writes it and then returns control to user mode the way it normally does; the user-mode code then cleans up the %sp/%fp registers so the stack is back where it belongs. Of course, the trap handler _should_ check the memory protection before reading or writing, and in non-buggy systems it does. If anyone is interested, I'll be glad to correspond further on any points which still remain unclear. der Mouse mouse@collatz.mcrcim.mcgill.edu